Advanced Data Science Innovation - Assignment 1

NBA Career Prediction - Predicting the 5-Year Career Longevity for NBA Rookies

Student: Carol Paipa / 90014679 / carol.m.paipa@student.uts.edu.au

Team: Group 1

  • Nuwan Munasinghe
  • Wenyingwuwy
  • Nathan Fragar
  • Sean Williams
  • Carol Myhill

1. Load and Prepare Data

[1.1] Import the required packages

In [189]:
# required python libraries
import pandas as pd
import numpy as np
from joblib import dump
import seaborn as sns
import matplotlib.pyplot as plt
import time

# scikit-learn models and functions
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import accuracy_score

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import mutual_info_regression

# Logistic Regression Models
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import plot_confusion_matrix

import warnings
warnings.filterwarnings('ignore')

[1.2] Load the training and test datasets

[2.2] Perform investigations to understand the training data

In [190]:
# import training & final test data
df_train = pd.read_csv('../data/raw/train.csv')
df_test = pd.read_csv('../data/raw/test.csv')
In [191]:
# quick investigation of the data
pd.set_option("display.max_columns", None)
df_train.head()
Out[191]:
Id GP MIN PTS FGM FGA FG% 3P Made 3PA 3P% FTM FTA FT% OREB DREB REB AST STL BLK TOV TARGET_5Yrs
0 10556 80 24.3 7.8 3.0 6.4 45.7 0.1 0.3 22.6 2.0 2.9 72.1 2.2 2.0 3.8 3.2 1.1 0.2 1.6 1
1 5342 75 21.8 10.5 4.2 7.9 55.1 -0.3 -1.0 34.9 2.4 3.6 67.8 3.6 3.7 6.6 0.7 0.5 0.6 1.4 1
2 5716 85 19.1 4.5 1.9 4.5 42.8 0.4 1.2 34.3 0.4 0.6 75.7 0.6 1.8 2.4 0.8 0.4 0.2 0.6 1
3 13790 63 19.1 8.2 3.5 6.7 52.5 0.3 0.8 23.7 0.9 1.5 66.9 0.8 2.0 3.0 1.8 0.4 0.1 1.9 1
4 5470 63 17.8 3.7 1.7 3.4 50.8 0.5 1.4 13.7 0.2 0.5 54.0 2.4 2.7 4.9 0.4 0.4 0.6 0.7 1
In [6]:
# quick investigation of the data - rows, columns
df_train.shape
Out[6]:
(8000, 21)
In [7]:
# quick investigation of the data - column info
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Id           8000 non-null   int64  
 1   GP           8000 non-null   int64  
 2   MIN          8000 non-null   float64
 3   PTS          8000 non-null   float64
 4   FGM          8000 non-null   float64
 5   FGA          8000 non-null   float64
 6   FG%          8000 non-null   float64
 7   3P Made      8000 non-null   float64
 8   3PA          8000 non-null   float64
 9   3P%          8000 non-null   float64
 10  FTM          8000 non-null   float64
 11  FTA          8000 non-null   float64
 12  FT%          8000 non-null   float64
 13  OREB         8000 non-null   float64
 14  DREB         8000 non-null   float64
 15  REB          8000 non-null   float64
 16  AST          8000 non-null   float64
 17  STL          8000 non-null   float64
 18  BLK          8000 non-null   float64
 19  TOV          8000 non-null   float64
 20  TARGET_5Yrs  8000 non-null   int64  
dtypes: float64(18), int64(3)
memory usage: 1.3 MB
In [192]:
# quick investigation of the data - column statistics
df_train.describe()
Out[192]:
Id GP MIN PTS FGM FGA FG% 3P Made 3PA 3P% FTM FTA FT% OREB DREB REB AST STL BLK TOV TARGET_5Yrs
count 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000 8000.000000
mean 6856.971000 62.777875 18.576663 7.267087 2.807037 6.231213 44.608900 0.264525 0.816563 19.583700 1.392525 1.947787 71.365825 1.077838 2.168500 3.245300 1.624513 0.648688 0.245212 1.257762 0.833625
std 3977.447579 17.118774 8.935263 4.318732 1.693373 3.584559 6.155453 0.384093 1.060964 16.003155 0.926153 1.252352 10.430447 0.785670 1.392224 2.085154 1.355986 0.407626 0.821037 0.723270 0.372440
min 4.000000 -8.000000 2.900000 0.800000 0.300000 0.800000 21.300000 -1.100000 -3.100000 -38.500000 0.000000 0.000000 -13.300000 0.000000 0.200000 0.300000 0.000000 0.000000 -17.900000 0.100000 0.000000
25% 3413.750000 51.000000 12.000000 4.100000 1.600000 3.600000 40.400000 0.000000 0.100000 8.400000 0.700000 1.000000 65.000000 0.500000 1.100000 1.700000 0.700000 0.300000 0.100000 0.700000 1.000000
50% 6787.500000 63.000000 16.800000 6.300000 2.400000 5.400000 44.400000 0.300000 0.800000 19.500000 1.200000 1.700000 71.400000 0.900000 1.900000 2.800000 1.300000 0.600000 0.200000 1.100000 1.000000
75% 10299.250000 74.000000 23.500000 9.500000 3.700000 8.100000 48.700000 0.500000 1.500000 30.600000 1.900000 2.600000 77.500000 1.500000 2.900000 4.300000 2.200000 0.900000 0.400000 1.600000 1.000000
max 13798.000000 123.000000 73.800000 34.200000 13.100000 28.900000 67.200000 1.700000 4.700000 82.100000 8.100000 11.100000 168.900000 5.500000 11.000000 15.900000 12.800000 3.600000 18.900000 5.300000 1.000000
In [193]:
# quick investigation of the data - check for Null/Nan values
print('Any NULL/NaN values?', df_train.isna().values.any())

df_train.isna()
Any NULL/NaN values? False
Out[193]:
Id GP MIN PTS FGM FGA FG% 3P Made 3PA 3P% FTM FTA FT% OREB DREB REB AST STL BLK TOV TARGET_5Yrs
0 False False False False False False False False False False False False False False False False False False False False False
1 False False False False False False False False False False False False False False False False False False False False False
2 False False False False False False False False False False False False False False False False False False False False False
3 False False False False False False False False False False False False False False False False False False False False False
4 False False False False False False False False False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7995 False False False False False False False False False False False False False False False False False False False False False
7996 False False False False False False False False False False False False False False False False False False False False False
7997 False False False False False False False False False False False False False False False False False False False False False
7998 False False False False False False False False False False False False False False False False False False False False False
7999 False False False False False False False False False False False False False False False False False False False False False

8000 rows × 21 columns

2. Exploratory Data Analysis

  1. Understand how the data is distributed
  2. What are the relationships between features
  3. Identify any imbalanced data
  4. Investigate any correlation between features
  5. Investigate any covariance between features

2.1 Data Distribution

Check which features are normally distributed v's skewed distribution. Features with skewed distributions may need transforming (future project stages).

In [10]:
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (20,10)

def draw_histograms(df, variables, n_rows, n_cols):
    fig=plt.figure()
    for i, var_name in enumerate(variables):
        ax=fig.add_subplot(n_rows,n_cols,i+1)
        df[var_name].hist(bins=10,ax=ax)
        ax.set_title(var_name)
    #fig.tight_layout()  # Improves appearance a bit.
    #fig.set_dpi(150)
    plt.show()

test = df_train 
draw_histograms(test, test.columns, 3, 7)

2.2 Pairwise relationships between features

This might be overkill - 21x21 columns - but I was curious how it would turn out, and if it would show anything interesting visually.

In [11]:
if 1==2:
    df_plot = df_train.copy()
    drop_cols = ['Id']
    df_plot.drop(drop_cols, axis=1, inplace=True)
    df_plot.columns = df_plot.columns.str.strip()
    target = df_plot.pop('TARGET_5Yrs')

    ax = sns.pairplot(df_plot)  #, hue='Type'
    plt.title('Pairwise relationships between the features')
    plt.show()

2.3 Check for Imbalanced Target Data

Imbalanced data may be addressed in the next stage of this project due to time constraints for this deadline.

In [19]:
x = df_train.groupby('TARGET_5Yrs').size()

# Plot the counts for TARGET_5Yrs
plt.rcParams["figure.figsize"] = (5,4) 
plt.bar(['0 = No','1 = Yes'], x)
plt.xlabel("TARGET_5Yrs")
plt.ylabel("# count")
plt.show()

2.4 Correlation Matrix between all features

A correlation matrix can help us quickly understand the correlations between each pair of variables. When two independent variables are highly correlated, this results in a problem known as multicollinearity and it can make it hard to interpret the results of the regression. One of the easiest ways to detect a potential multicollinearity problem is to look at a correlation matrix and visually check whether any of the variables are highly correlated with each other.

In [194]:
import seaborn as sns
import matplotlib.pyplot as plt

correlation_matrix = pd.DataFrame(df_train.iloc[:,1:-1]).corr()
plt.figure(figsize=(12,10))
ax = sns.heatmap(correlation_matrix, vmax=1, square=True, annot=True, fmt='.2f', cmap ='GnBu', cbar_kws={"shrink": .7}, robust=True)
#plt.xticks(np.arange(len(labels)), rotation=45)
#plt.yticks(np.arange(len(labels)), rotation=45)
plt.title('Correlation matrix between the features', fontsize=20)
plt.show()

2.5 Covariance Matrix

These values in the covariance matrix show the distribution magnitude and direction of multivariate data in multidimensional space of each pair of variables.

Covariance indicates the relationship of two variables whenever one variable changes. If an increase in one variable results in an increase in the other variable, both variables are said to have a positive covariance. Decreases in one variable also cause a decrease in the other.

You can use the covariance to determine the direction of a linear relationship between two variables as follows:

  1. If both variables tend to increase or decrease together, the coefficient is positive.
  2. If one variable tends to increase as the other decreases, the coefficient is negative.
In [195]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# capture labels first
labels = df_train.iloc[:,1:-1].columns

# scaling the data vastly improves the covariance matrix results
scaler = StandardScaler()
df_scale = scaler.fit_transform(df_train.iloc[:,1:-1])

# transpose 8000x19 row/cols to 19x8000 row/cols
data = df_scale.transpose()

# calculate covariance matrix of all features
covMatrix = np.cov(data, bias=True)

plt.figure(figsize=(12,10))
sns.heatmap(covMatrix, annot=True, fmt='.2g', cbar_kws={"shrink": .7})
plt.xticks(np.arange(len(labels)), labels=labels, rotation=90)
plt.yticks(np.arange(len(labels)), labels=labels, rotation=0)
plt.title('Covariance matrix between the features', fontsize=20)
plt.show()

3. Prepare Data for model training

[3.1] Copy data for transformation for modelling steps

In [196]:
# Create a copy of df and save it into a variable called df_cleaned
df_cleaned = df_train.copy()
df_clean_test = df_test.copy()

[3.2] We need to drop the Id column as this is irrelevant for modelling

In [197]:
# Drop columns 'Id'
drop_cols = ['Id']   

df_cleaned.drop(drop_cols, axis=1, inplace=True)
df_clean_test.drop(drop_cols, axis=1, inplace=True)

[3.3] Remove leading and trailing space from the column names

In [198]:
df_cleaned.columns = df_cleaned.columns.str.strip()
df_clean_test.columns = df_clean_test.columns.str.strip()

[3.4] Extract the column TARGET_5Yrs and save it into variable called target

In [199]:
# Extract the column TARGET_5Yrs and save it into variable called target
target = df_cleaned.pop('TARGET_5Yrs')
print('df_cleaned.shape',df_cleaned.shape,'\n')

# we will need labels later for plotting results
labels = df_cleaned.columns 
df_cleaned.shape (8000, 19) 

Standardise Data as contains negative numbers

[3.5] Import StandardScaler from sklearn.preprocessing
[3.6] Instantiate the StandardScaler
[3.7] Fit and apply the scaling on df_cleaned

In [200]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

df_cleaned = scaler.fit_transform(df_cleaned)
df_clean_test = scaler.fit_transform(df_clean_test)

[3.8] Import dump from joblib
[3.9] Save the scaler into the folder models and call the file scaler.joblib

In [201]:
from joblib import dump
dump(scaler, '../models/scaler.joblib')
Out[201]:
['../models/scaler.joblib']

Split training, test, validation datasets

[3.10] Import train_test_split from sklearn.model_selection

[3.11] Split randomly the dataset with random_state=8 into 2 different sets: data (80%) and test (20%)

[3.12] Split the remaining data (80%) randomly with random_state=8 into 2 different sets: training (80%) and validation (20%)

In [202]:
from sklearn.model_selection import train_test_split

# Split randomly the dataset with random_state=8 into 2 different sets: data (80%) and test (20%)
X_data, X_test, y_data, y_test = train_test_split (df_cleaned, target, test_size=0.2, random_state=8)

# Split the remaining data (80%) randomly with random_state=8 into 2 different sets: training (80%) and validation (20%)
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=8)

[3.13] Save the different sets in the folder data/processed

In [204]:
# save V2 for reduced features
np.save('../data/processed/X_train', X_train)
np.save('../data/processed/X_val',   X_val)
np.save('../data/processed/X_test',  X_test)
np.save('../data/processed/y_train', y_train)
np.save('../data/processed/y_val',   y_val)
np.save('../data/processed/y_test',  y_test)

# save the final Test data for submitting model results to Kaggle
np.save('../data/processed/final_test',  df_clean_test)

4. Get Baseline Model

[4.1] Calculate the average of the target variable for the training set and save it into a variable called y_mean

[4.2] Create a numpy array called y_base of dimensions (len(y_train), 1) filled with this value

In [205]:
# Calculate the average of the target variable for the training set
y_mean = y_train.mean()
print('y_mean',y_mean)

# Create a numpy array called `y_base` of dimensions (len(y_train), 1) filled with this value
y_base = np.full((len(y_train), 1), y_mean)
y_mean 0.837109375

[4.3] Import the MSE and MAE metrics from sklearn

[4.4] Display the RMSE and MAE scores of this baseline model

In [206]:
# Import the MSE and MAE metrics from sklearn
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae

# Display the RMSE and MAE scores of this baseline model
print('1. Baseline model scores - training data')
print('RMSE:',mse(y_train, y_base, squared=False))
print('MAE: ',mae(y_train, y_base))
1. Baseline model scores - training data
RMSE: 0.3692658517749907
MAE:  0.27271453857421873

5. Feature selection

Note: I have not used 5.1 or 5.2 in this experiment. However, they are here for experiment 2 future work.

I have incorporated 5.3 and 5.4 into the feature selection for Tests 2 and 3 in this experiment.

5.1 L1-based feature selection

Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along with SelectFromModel to select the non-zero coefficients. In particular, sparse estimators useful for this purpose are the Lasso for regression, and of LogisticRegression and LinearSVC for classification. https://scikit-learn.org/stable/modules/feature_selection.html#l1-based-feature-selection

In [207]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

X, y = X_data, y_data
print('LinearSVC')
print('X    ',X.shape)

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)

X_new = model.transform(X)
print('X_new',X_new.shape)
LinearSVC
X     (6400, 19)
X_new (6400, 8)

5.2 Tree-based feature selection

Tree-based estimators can be used to compute impurity-based feature importances, which in turn can be used to discard irrelevant features (when coupled with the SelectFromModel meta-transformer). https://scikit-learn.org/stable/modules/feature_selection.html#tree-based-feature-selection

In [208]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

X, y = X_data, y_data
print('ExtraTreesClassifier')
print(X.shape)
       
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
print(clf.feature_importances_)  

model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
print(X_new.shape)    
ExtraTreesClassifier
(6400, 19)
[0.073748   0.05129292 0.04944642 0.05234696 0.05137037 0.05760855
 0.04752286 0.0500017  0.05152289 0.05441841 0.05144485 0.05275767
 0.05081723 0.05139866 0.05275705 0.04896402 0.04926559 0.05097302
 0.05234281]
(6400, 5)

5.3 Feature selection using the Correlation metric

Select features according to the k highest scores.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

In [209]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import mutual_info_regression
import matplotlib.pyplot as plt

# feature selection
f_selector = SelectKBest(score_func=f_regression, k='all')

# learn relationship from training data
f_selector.fit(X_train, y_train)

# transform train input data
X_train_fs = f_selector.transform(X_train)

# transform test input data
X_test_fs = f_selector.transform(X_test)

# Plot the scores for the features
plt.rcParams["figure.figsize"] = (7,5) 
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)
plt.xticks(np.arange(len(labels)),labels=labels, rotation=45)
plt.xlabel("feature index")
plt.ylabel("F-value (transformed from the correlation values)")
plt.show()

5.4 Feature selection using the Mutual Information metric

Estimate mutual information for a continuous target variable.

Mutual information (MI) [1] between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html

In [210]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import mutual_info_regression
import matplotlib.pyplot as plt

# feature selection
f_selector = SelectKBest(score_func=mutual_info_regression, k='all')

# learn relationship from training data
f_selector.fit(X_train, y_train)

# transform train input data
X_train_fs = f_selector.transform(X_train)

# transform test input data
X_test_fs = f_selector.transform(X_test)

# Plot the scores for the features
plt.rcParams["figure.figsize"] = (7,5) 
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)
plt.xlabel("feature index")
plt.xticks(np.arange(len(labels)),labels=labels, rotation=45)
plt.ylabel("Estimated MI value")
plt.show()

6. Model Training, Testing and Evaluation

Training using Neural Network

  • Multi-layer Perceptron classifier.
  • Optimizes the log-loss function using LBFGS or stochastic gradient descent.

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

[6.1] Train the model

In [211]:
from sklearn.neural_network import MLPClassifier

rerun_no = 1
classifier_name="NeuralNetworkMLP"
random_state=2
max_iter=300
activation='logistic'
solver='adam'     # sgd, adam (default)
alpha=0.01        # 0.0001 default
batch_size=100    

#--------------------------------------------------------------------------
t_start = time.process_time()
# fit model
classifier = MLPClassifier(activation=activation, solver=solver, alpha=alpha, 
                           batch_size=batch_size, random_state=random_state).fit(X_train, y_train)
t_end = time.process_time()       
t_diff = t_end - t_start

#--------------------------------------------------------------------------
# Compare accuracy of the given test data
y_base = np.full((len(y_train), 1), y_train.mode())

print("Compare accuracy between data sets")
print("Baseline:   ",accuracy_score(y_train, y_base))
print("Train data: ",classifier.score(X_train, y_train))
print("Validation: ",classifier.score(X_val, y_val))
Compare accuracy between data sets
Baseline:    0.837109375
Train data:  0.83828125
Validation:  0.8109375

[6.2] Save the fitted model into the folder 'models'

In [212]:
#--------------------------------------------------------------------------
# Save the fitted model into the folder 'models', named for each classifier
dump(classifier,  '../models/r{d}_{c}.joblib'.format(d=rerun_no, c=classifier_name))
Out[212]:
['../models/r1_NeuralNetworkMLP.joblib']

[6.3] Model evaluation and performance

In [213]:
from sklearn.metrics import roc_curve

# predict probabilities
pred_prob = classifier.predict_proba(X_test)

# roc curve for models
fpr, tpr, thresh = roc_curve(y_test, pred_prob[:,1], pos_label=1)

# roc curve for tpr = fpr 
random_probs = [0 for i in range(len(y_test))]
p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)
In [214]:
from sklearn.metrics import roc_auc_score

# auc scores
auc_score = roc_auc_score(y_test, pred_prob[:,1])

print('auc_score',auc_score)
auc_score 0.710822219840012
In [215]:
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt

print('Confusion matrix - training data')
plot_confusion_matrix(classifier, X_train, y_train, cmap=plt.cm.Blues, normalize='true');
Confusion matrix - training data
In [216]:
print('Confusion matrix - validation data')
plot_confusion_matrix(classifier, X_val, y_val, cmap=plt.cm.Blues, normalize='true');
Confusion matrix - validation data

[6.3] save the final "test" prediction probabilities for Kaggle

  • combine final "test" Id column with prediction probabilities column
  • Save the final predictions to CSV for submission to Kaggle

https://www.kaggle.com/c/uts-advdsi-22-02-nba-career-prediction/overview

In [217]:
#save the final "test" prediction probabilities for Kaggle
y_final_preds = classifier.predict_proba(df_clean_test)

# combine final "test" Id column with prediction probabilities column (cover to dataframe first) 
frames = [df_test.iloc[:,0], pd.DataFrame(y_final_preds[:,1])]
result = pd.concat(frames, axis=1) 
result.columns = ['Id','tmp']
result['TARGET_5Yrs'] = [round(num, 2) for num in result['tmp']]
result.drop(['tmp'], axis=1, inplace=True)

#--------------------------------------------------------------------------
# Save the final predictions for submission to Kaggle
result.to_csv('../data/processed/group1_r{d}_{c}.csv'.format(d=rerun_no, c=classifier_name), index=False)
print('kaggle results saved ../data/processed/group1_r{d}_{c}.csv'.format(d=rerun_no, c=classifier_name))
kaggle results saved ../data/processed/group1_r1_NeuralNetworkMLP.csv

7. Push changes to GitHub

[7.1] Add changes to git staging area

[7.2] Create the snapshot of your repository and add a description

[7.3] Push your snapshot to Github

In [188]:
# Code saved here for easy reference, but do not run as code
# https://github.com/CazMayhem/adv_dsi_AT1

"""
    # Add changes to git staging area
    git add .

    # Create the snapshot of your repository and add a description
    git commit -m "assignement & kaggle submission"

    # Push your snapshot to Github
    git push https://******@github.com/CazMayhem/adv_dsi_AT1.git
"""    
Out[188]:
'\n    # Add changes to git staging area\n    git add .\n\n    # Create the snapshot of your repository and add a description\n    git commit -m "commit version 2"\n\n    # Push your snapshot to Github\n    git push https://******@github.com/CazMayhem/adv_dsi_AT1.git\n'

[7.4] Close Jupyter Lab with control (command) + c